{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d95f841a-63c9-41d4-aea1-496b3d2024dd",
   "metadata": {},
   "source": [
    "**LLM Workshop 2024 by Sebastian Raschka**\n",
    "\n",
    "This code is based on *Build a Large Language Model (From Scratch)*, [https://github.com/rasbt/LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25aa40e3-5109-433f-9153-f5770531fe94",
   "metadata": {},
   "source": [
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "# 2) Understanding LLM Input Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "76d5d2c0-cba8-404e-9bf3-71a218cae3cf",
   "metadata": {},
   "source": [
    "Packages that are being used in this notebook:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d1305cf-12d5-46fe-a2c9-36fb71c5b3d3",
   "metadata": {},
   "outputs": [],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "\n",
    "print(\"torch version:\", version(\"torch\"))\n",
    "print(\"tiktoken version:\", version(\"tiktoken\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a42fbfd-e3c2-43c2-bc12-f5f870a0b10a",
   "metadata": {},
   "source": [
    "- This notebook provides a brief overview of the data preparation and sampling procedures to get input data \"ready\" for an LLM\n",
    "- Understanding what the input data looks like is a great first step towards understanding how LLMs work"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "628b2922-594d-4ff9-bd82-04f1ebdf41f5",
   "metadata": {},
   "source": [
    "<img src=\"./figures/01.png\" width=\"1000px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eddbb984-8d23-40c5-bbfa-c3c379e7eec3",
   "metadata": {},
   "source": [
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "# 2.1 Tokenizing text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9c90731-7dc9-4cd3-8c4a-488e33b48e80",
   "metadata": {},
   "source": [
    "- In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "09872fdb-9d4e-40c4-949d-52a01a43ec4b",
   "metadata": {},
   "source": [
    "<img src=\"figures/02.png\" width=\"800px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8cceaa18-833d-46b6-b211-b20c53902805",
   "metadata": {},
   "source": [
    "- Load raw text we want to work with\n",
    "- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a769e87-470a-48b9-8bdb-12841b416198",
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(\"the-verdict.txt\", \"r\", encoding=\"utf-8\") as f:\n",
    "    raw_text = f.read()\n",
    "    \n",
    "print(\"Total number of character:\", len(raw_text))\n",
    "print(raw_text[:99])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b971a46-ac03-4368-88ae-3f20279e8f4e",
   "metadata": {},
   "source": [
    "- The goal is to tokenize and embed this text for an LLM\n",
    "- Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6cbe9330-b587-4262-be9f-497a84ec0e8a",
   "metadata": {},
   "source": [
    "<img src=\"figures/03.png\" width=\"690px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3daa1687-2c08-485a-87cc-a93c2f9586d7",
   "metadata": {},
   "source": [
    "- The following regular expression will split on whitespaces and punctuation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "737dd5b0-9dbb-4a97-9ae4-3482c8c04be7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "preprocessed = re.split(r'([,.:;?_!\"()\\']|--|\\s)', raw_text)\n",
    "preprocessed = [item for item in preprocessed if item]\n",
    "print(preprocessed[:38])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "35db7b5e-510b-4c45-995f-f5ad64a8e19c",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Number of tokens:\", len(preprocessed))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0b5ce8fe-3a07-4f2a-90f1-a0321ce3a231",
   "metadata": {},
   "source": [
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "# 2.2 Converting tokens into token IDs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a5204973-f414-4c0d-87b0-cfec1f06e6ff",
   "metadata": {},
   "source": [
    "- Next, we convert the text tokens into token IDs that we can process via embedding layers later\n",
    "- For this we first need to build a vocabulary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "177b041d-f739-43b8-bd81-0443ae3a7f8d",
   "metadata": {},
   "source": [
    "<img src=\"figures/04.png\" width=\"900px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8eeade64-037b-4b59-9039-d3b000ef8886",
   "metadata": {},
   "source": [
    "- The vocabulary contains the unique words in the input text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fdf0533-5ab6-42a5-83fa-a3b045de6396",
   "metadata": {},
   "outputs": [],
   "source": [
    "all_words = sorted(set(preprocessed))\n",
    "vocab_size = len(all_words)\n",
    "\n",
    "print(vocab_size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "77d00d96-881f-4691-bb03-84fec2a75a26",
   "metadata": {},
   "outputs": [],
   "source": [
    "vocab = {token:integer for integer,token in enumerate(all_words)}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "75bd1f81-3a8f-4dd9-9dd6-e75f32dacbe3",
   "metadata": {},
   "source": [
    "- Below are the first 50 entries in this vocabulary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1c5de4a-aa4e-4aec-b532-10bb364039d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "for i, item in enumerate(vocab.items()):\n",
    "    print(item)\n",
    "    if i >= 50:\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3b1dc314-351b-476a-9459-0ec9ddc29b19",
   "metadata": {},
   "source": [
    "- Below, we illustrate the tokenization of a short sample text using a small vocabulary:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "67407a9f-0202-4e7c-9ed7-1b3154191ebc",
   "metadata": {},
   "source": [
    "<img src=\"figures/05.png\" width=\"800px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e569647-2589-4c9d-9a5c-aef1c88a0a9a",
   "metadata": {},
   "source": [
    "- Let's now put it all together into a tokenizer class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f531bf46-7c25-4ef8-bff8-0d27518676d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "class SimpleTokenizerV1:\n",
    "    def __init__(self, vocab):\n",
    "        self.str_to_int = vocab\n",
    "        self.int_to_str = {i:s for s,i in vocab.items()}\n",
    "    \n",
    "    def encode(self, text):\n",
    "        preprocessed = re.split(r'([,.?_!\"()\\']|--|\\s)', text)\n",
    "        preprocessed = [\n",
    "            item.strip() for item in preprocessed if item.strip()\n",
    "        ]\n",
    "        ids = [self.str_to_int[s] for s in preprocessed]\n",
    "        return ids\n",
    "        \n",
    "    def decode(self, ids):\n",
    "        text = \" \".join([self.int_to_str[i] for i in ids])\n",
    "        # Replace spaces before the specified punctuations\n",
    "        text = re.sub(r'\\s+([,.?!\"()\\'])', r'\\1', text)\n",
    "        return text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dee7a1e5-b54f-4ca1-87ef-3d663c4ee1e7",
   "metadata": {},
   "source": [
    "- The `encode` function turns text into token IDs\n",
    "- The `decode` function turns token IDs back into text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc21d347-ec03-4823-b3d4-9d686e495617",
   "metadata": {},
   "source": [
    "<img src=\"figures/06.png\" width=\"800px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2950a94-6b0d-474e-8ed0-66d0c3c1a95c",
   "metadata": {},
   "source": [
    "- We can use the tokenizer to encode (that is, tokenize) texts into integers\n",
    "- These integers can then be embedded (later) as input of/for the LLM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "647364ec-7995-4654-9b4a-7607ccf5f1e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer = SimpleTokenizerV1(vocab)\n",
    "\n",
    "text = \"\"\"\"It's the last he painted, you know,\" \n",
    "           Mrs. Gisburn said with pardonable pride.\"\"\"\n",
    "ids = tokenizer.encode(text)\n",
    "print(ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3201706e-a487-4b60-b99d-5765865f29a0",
   "metadata": {},
   "source": [
    "- We can decode the integers back into text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01d8c8fb-432d-4a49-b332-99f23b233746",
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer.decode(ids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "54f6aa8b-9827-412e-9035-e827296ab0fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer.decode(tokenizer.encode(text))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c4ba34b-170f-4e71-939b-77aabb776f14",
   "metadata": {},
   "source": [
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "# 2.3 BytePair encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2309494c-79cf-4a2d-bc28-a94d602f050e",
   "metadata": {},
   "source": [
    "- GPT-2 used BytePair encoding (BPE) as its tokenizer\n",
    "- it allows the model to break down words that aren't in its predefined vocabulary into smaller subword units or even individual characters, enabling it to handle out-of-vocabulary words\n",
    "- For instance, if GPT-2's vocabulary doesn't have the word \"unfamiliarword,\" it might tokenize it as [\"unfam\", \"iliar\", \"word\"] or some other subword breakdown, depending on its trained BPE merges\n",
    "- The original BPE tokenizer can be found here: [https://github.com/openai/gpt-2/blob/master/src/encoder.py](https://github.com/openai/gpt-2/blob/master/src/encoder.py)\n",
    "- In this lecture, we are using the BPE tokenizer from OpenAI's open-source [tiktoken](https://github.com/openai/tiktoken) library, which implements its core algorithms in Rust to improve computational performance\n",
    "- (Based on an analysis [here](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb), I found that `tiktoken` is approx. 3x faster than the original tokenizer and 6x faster than an equivalent tokenizer in Hugging Face)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ede1d41f-934b-4bf4-8184-54394a257a94",
   "metadata": {},
   "outputs": [],
   "source": [
    "# pip install tiktoken"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48967a77-7d17-42bf-9e92-fc619d63a59e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import importlib\n",
    "import tiktoken\n",
    "\n",
    "print(\"tiktoken version:\", importlib.metadata.version(\"tiktoken\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ad3312f-a5f7-4efc-9d7d-8ea09d7b5128",
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer = tiktoken.get_encoding(\"gpt2\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ff2cd85-7cfb-4325-b390-219938589428",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = (\n",
    "    \"Hello, do you like tea? <|endoftext|> In the sunlit terraces\"\n",
    "     \"of someunknownPlace.\"\n",
    ")\n",
    "\n",
    "integers = tokenizer.encode(text, allowed_special={\"<|endoftext|>\"})\n",
    "\n",
    "print(integers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d26a48bb-f82e-41a8-a955-a1c9cf9d50ab",
   "metadata": {},
   "outputs": [],
   "source": [
    "strings = tokenizer.decode(integers)\n",
    "\n",
    "print(strings)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8c2e7b4-6a22-42aa-8e4d-901f06378d4a",
   "metadata": {},
   "source": [
    "- BPE tokenizers break down unknown words into subwords and individual characters:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c082d41f-33d7-4827-97d8-993d5a84bb3c",
   "metadata": {},
   "source": [
    "<img src=\"figures/07.png\" width=\"700px\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0beb27ee-1156-457c-839e-eebb48d94d0e",
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer.encode(\"Akwirw ier\", allowed_special={\"<|endoftext|>\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "abbd7c0d-70f8-4386-a114-907e96c950b0",
   "metadata": {},
   "source": [
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "# 2.4 Data sampling with a sliding window"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "509d9826-6384-462e-aa8a-a7c73cd6aad0",
   "metadata": {},
   "source": [
    "- Above, we took care of the tokenization (converting text into word tokens represented as token ID numbers)\n",
    "- Now, let's talk about how we create the data loading for LLMs\n",
    "- We train LLMs to generate one word at a time, so we want to prepare the training data accordingly where the next word in a sequence represents the target to predict"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "39fb44f4-0c43-4a6a-9c2f-9cf31452354c",
   "metadata": {},
   "source": [
    "<img src=\"figures/08.png\" width=\"800px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c9a3d50-885b-49bc-b791-9f5cc8bc7b7c",
   "metadata": {},
   "source": [
    "- For this, we use a sliding window approach, changing the position by +1:\n",
    "\n",
    "<img src=\"figures/09.png\" width=\"900px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b006212f-de45-468d-bdee-5806216d1679",
   "metadata": {},
   "source": [
    "- Note that in practice it's best to set the stride equal to the context length so that we don't have overlaps between the inputs (the targets are still shifted by +1 always)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9cb467e0-bdcd-4dda-b9b0-a738c5d33ac3",
   "metadata": {},
   "source": [
    "<img src=\"figures/10.png\" width=\"800px\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb55f51a",
   "metadata": {},
   "outputs": [],
   "source": [
    "from supplementary import create_dataloader_v1\n",
    "\n",
    "\n",
    "dataloader = create_dataloader_v1(raw_text, batch_size=8, max_length=4, stride=4, shuffle=False)\n",
    "\n",
    "data_iter = iter(dataloader)\n",
    "inputs, targets = next(data_iter)\n",
    "print(\"Inputs:\\n\", inputs)\n",
    "print(\"\\nTargets:\\n\", targets)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2dc671fb-6945-4594-b33f-8b462a69720d",
   "metadata": {},
   "source": [
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "<br>\n",
    "\n",
    "# Exercise: Prepare your own favorite text dataset"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
